DAY 3 - Store Vector to DB - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2024 iThome 鐵人賽

DAY 22

DevOps

將 AI Code Review 整進 CICD系列第 22 篇

DAY 3 - Store Vector to DB

16th鐵人賽

嗷嗷嗷

2024-09-05 17:45:14

257 瀏覽

分享至

既然我們將資料轉成 Vector 的格式了，我們接下來要將他儲存在 DB。

由於資料特性的原因，Vector 並不適合存在傳統資料庫，所以需要向量資料庫來滿足效能跟效率。在向量資料庫內，不是根據欄、列跟文字比對，而是依靠相似性來查找。其中主要是 NNS(最鄰近搜索)。而其實踐方式又有餘弦相似性 or K-Means，每種有它不同的效能跟效率考量。也由於有時會為了效能而捨棄精準，部分又稱為 Approximate Nearest Neighbors (ANN) 近似最鄰近搜索

根據前述所說，每個物體都會有超多維度的特徵，存起來數據量可是不得了。這邊就有 PQ 的概念能夠將空間大幅下降。

而這麼多的量，為了提高搜尋效率，也會有HNSW 等改良搜索做法。能夠綜合以上幾點的就是現代向量資料庫實作的原理。

而各種資料庫各有特色，可以參考此篇 https://blog.darkthread.net/blog/vector-db-survey/，這邊實作我們會用輕巧為主的 Chroma 來練習。

pip install chromadb

collection → 存 embedding、文件、metadata。Embeddings：當您添加文檔時，Chroma會自動為每個文檔創建embedding。

我們來看一個來自官方的範例

import os
import chromadb
chroma_client = chromadb.Client()

collection = chroma_client.create_collection(name="my_collection")
collection.add(
    documents=[
        "This is a document about pineapple",
        "This is a document about oranges"
    ],
    ids=["id1", "id2"]
)

results = collection.query(
    query_texts=["This is a query document about hawaii"], # Chroma will embed this for you
    n_results=2 # how many results to return
)
print(results)

第一次運行的時候，他會下載一個嵌入模型（embedding model）來將文本轉換為向量。"onnx.tar.gz" 是這個模型的 ONNX（Open Neural Network Exchange）格式版本，這種格式可以提高推理速度

/Users/xxx/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|█| 79.3M/79.3M [01:19<00

結果可以看到他回傳兩個可能相似的結果， ['id1', 'id2']。distances 代表相似性，數值愈小越相似

{'ids': [['id1', 'id2']], 'distances': [[1.0403728485107422, 1.2430635690689087]], 'metadatas': [[None, None]], 'embeddings': None, 'documents': [['This is a document about pineapple', 'This is a document about oranges']], 'uris': None, 'data': None, 'included': ['metadatas', 'documents', 'distances']}

因為想看一下 embedding 的結果，所以我額外新增 include

results = collection.query(
    query_texts=["This is a query document about hawaii"], # Chroma will embed this for you
    n_results=2, # how many results to return
    include=["metadatas", "documents", "distances", "embeddings"]  # 添加 "embeddings"
)

'embeddings': [[[-0.007089237216860056, 0.06553314626216888, ...
                 0.07775536179542542, 0.01559150218963623]]]

但這個東西執行完就結束了，所以我們要將他儲存下來。所以要為他定義 persistence 的位置。

import chromadb
import os

# 定義持久化存儲的路徑
persist_directory = os.path.join(os.getcwd(), "chroma_db")

# 創建持久化 Chroma 客戶端
chroma_client = chromadb.PersistentClient(path=persist_directory)

# 創建或獲取集合
collection = chroma_client.get_or_create_collection(name="my_persistent_collection")

新增之後，就會多了一個 chroma_db 的 folder

$ tree -L 2

chroma_db
├── chroma.sqlite3
└── d54d3188-5a5a-4a40-b637-ebd47eb803e6
    ├── data_level0.bin
    ├── header.bin
    ├── length.bin
    └── link_lists.bin